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Abstract To classify time series by nearest neighbors, we need to specify or 
learn one or several distance measures. We consider variations of the Maha- 
lanobis distance measures which rely on the inverse covariance matrix of the 
data. Unfortunately — for time series data — the covariance matrix has of- 
ten low rank. To alleviate this problem we can either use a pseudoinverse, 
covariance shrinking or limit the matrix to its diagonal. We review these alter- 
natives and benchmark them against competitive methods such as the related 
Large Margin Nearest Neighbor Classification (LMNN) and the Dynamic Time 
Warping (DTW) distance. As we expected, we find that the DTW is superior, 
but the Mahalanobis distance measures are one to two orders of magnitude 
faster. To get best results with Mahalanobis distance measures, we recommend 
learning one distance measure per class using either covariance shrinking or 
the diagonal approach. 

Keywords Time-series classification • Distance measure learning • Nearest 
Neighbor • Mahalanobis distance measure 
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1 Introduction 

Time series are sequences of values measured over time. Examples include 
financial data, such as stock prices, or medical data, such as blood sugar levels. 
Classifying time series is an important class of problems which is applicable to 
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music classification ( Weihs et al| |2007 ), medical diagnostic (Sternickel 2002 1 



or bioinformatics ( Legrand et al 2008 ) 



Nearest Neighbor (NN) methods classify time series efficiently and accu- 
rately (Ding et al 2008). In the 1-NN method, we let the unclassified instance 



be in the same class as its nearest classified neighbor. 

We need to specify a distance measure: the Euclidean and Dynamic Time 



Warping (Sakoe and Chiba 1978a) distances are popular choices. However, we 



can also learn a distance measure based on some training data (Weinberger and 



Saul 2009 Yang and Jin 2006 1 . Given the training data set made of classes of 



time series instances, we can either learn a single (global) distance measure, or 



learn one distance measure per class (Csatari and Prekopcsak 2010 Paredes 



and Vidal 2000 2006 ) . That is, to compute the distance between a test element 



and an instance of class j, we use a distance measure specific to class j. 

Consider a family of time series xW,x( 2 ), . . . of lengths n. Because 

the Euclidean distance is popular for NN classification, it is tempting to con- 
sider generalized ellipsoid distance measures (Ishikawa et al 1998), that is, 
distance measures of the form 



D M (x,y) = (x - 



y) T M(x- 



y) 



where M is a positive semi-definite matrix and x, y are two time series of 
lengths n. When the matrix M is the identity matrix, we recover the (squared) 
Euclidean distance. We get the Mahalanobis distance measure when we use 
the matrix M minimizing the sum of distances between the time series in 
S: J2 X y< - s Dj\f(x, y) (see §[3j)- Unfortunately, in the context of time series, 
solving for such an optimal matrix often involves inverting a low-rank matrix. 

Our main contribution is to survey and compare techniques to solve this 
mathematical difficulty: 

— We may require M to be a diagonal matrix. 

— We can use a pseudoinverse. 

— We can apply covariance shrinkage. 

Moreover, we can either learn one such distance measure for the entire data 
set, or one distance measure per class. To our knowledge, there was no attempt 
to compare these alternatives in the context of time series. After comparing 
these alternatives, we present two main findings: 

— We get significantly poorer classification accuracy when using pseudoin- 
verses. Indeed, the pseudoinverse approach generates twice the error rate 
of the covariance shrinkage or diagonal-matrix approach. 

— We find that the class-specific Mahalanobis distance measures are prefer- 
able to the global Mahalanobis distance measure. That is, it is best to learn 
one distance measure per class instead of learning one overall distance mea- 
sure. 

We also compare our results with other well established techniques such as 
Large Margin Nearest Neighbor Classification (LMNN) and the Dynamic Time 
Warping (DTW) distance. We find that even though the DTW has superior 
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classification accuracy, it is one to two orders of magnitude slower than Ma- 
halanobis distance measures. 



2 Related Works 



Consider two time series x and y of lengths n. The i th data point of time series 
x is written Xi. Two of the most common distances between time series are 
the Manhattan and Euclidean distances. They are special cases (p = 1 and 
p = 2) of the Minkowski distance: \xi — yi\ p . Several other distance 



measures are used for time series classification. Ding et al (2008) presented 



an extensive comparison of these distance measures and concluded that DTW 
is among the best measures and that the accuracy of the Euclidean distance 
converges to DTW as the size of the training set increases. 



In a general Machine Learning setting, Paredes and Vidal (2000 2006) 



compared Euclidean distance with the conventional and class-specific Maha- 
lanobis distance measures. One of our contribution is to validate these generic 
results on time series: instead of tens of features, we have hundreds or even 
thousands of values which makes the problem mathematically more challeng- 
ing: the ranks of our covariance matrices are often tiny compared to their 
sizes. 



More generally, distance metric learning has an extensive literature (Chai 



eFaIl|2010l|Hastie and Tibshirani[ |T996 Short and Fu kunaga[ [1980| | Wettscherefr k 



et al 1997 ) . We refer the reader to Weinberger and Saul ( 2009 ) for a review. 



There are many extensions and alternatives to NN classification. For ex- 



Meanwhile, Zhan et al 



ample, Jahromi et al ( 2009[) use instance weights to improve classification. 



(2009) learn a distance measure per instance. More 



generally, the problem of classifying time series has a long history in statis- 



tics (Fisher 1936 Hastie and Tibshirani 1996 R.H. and Shumway 1982 1. 



2.1 Dynamic Time Warping Distance 



The Dynamic Time Warping distance (DTW) is a generalization of the Minkowski 



distance which allows the data to be realigned (Itakura 1975 Sakoe and 



Chiba 1978b). To compute the DTW between x and y, you must find a 



many-to-many matching between the data points in x and the data points 
in y. That is each data point from one series must be matched with at least 
one data point with the other series. One such matching is the trivial one, 
which maps the first data point from x to the first data point in y, the sec- 
ond data point in x to the second data point in y, and so on. A matching 
can be written as a list of pairs of indexes with one index in the first time 
series and one index in the other. For example, the trivial matching is just 
r = {(1, 1), (2, 2), . . . , (n, n)}. The Minkowski distance corresponding to a 

matching is defined as ^Jj2(ij)er I 
our purposes we choose p = 2. 



Hj\ p - Typically, p is either 1 or 2: for 
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For a given p, the DTW is defined as the minimal Minkowski distance over 
all allowed matchings T. That is DTW(x,y) = minr 



We can solve for r using dynamic programming. It is required for matchings 
to be monotonic: if both and (i + are in r then f > j, that is, 

we cannot warp back in time: if the first index increases, the second index 
cannot decrease. Because of monotonicity, the DTW is not invariant under 
permutation of the coordinates. The DTW between (0, 1, 0, 2) and (0, 1, 1, 2) is 
one with A = {(1, 1), (2, 2), (3, 3), (4, 4)} whereas the DTW between (0, 0, 1, 2) 
and (0, 1, 1, 2) is zero with T 2 = {(1, 1), (2, 1), (3, 2), (3, 3), (4, 4)}. Yet they only 
differ by the permutation of the second and third data points. 

Unlike many other distance measures, such as the Euclidean distance, 
the DTW can handle sequences of different lengths. However, according to 
Ratanamahatana and Keogh ( 2005 1 "comparing sequences of different lengths 



and reinterpolating them to equal length produce no statistically significant 
difference in accuracy or precision/recall." In other words, when comparing 
time series having different lengths, we may linearly interpolate them to have 
the same length without loss of classification accuracy. 

As an extension, some matches might be forbidden if the data points are 



too far apart (Itakura 1975 Sakoe and Chiba 1978b ). Yu et al (2011 ) has pro- 



posed learning this warping constraint from the data. Gaudin and Nicoloyannis| 
( 2006 ) proposed a weighted version of the DTW called Adaptable Time Warp- 
ing. Instead of computing ^ er 



\%i - Vj\ p , h computes £V 



,) r M i,3\ 



(i,j) 

yj\ p where M is some matrix. Unfortunately, finding the optimal matrix M can 
be a challenge. Jeong et al (2011 ) investigated another form of weighted DTW 
where you seek to minimize the cost t/^u ^^^Wu.-4\\xi — vA p where w is 



'(ij')e-T w \i-j\\ x i ~ Vj\ 

some weight vector. Many other variations on the DTW have been proposed, 
e.g., Chouakria and Nagabhushan (2007). 

One disadvantage is that the DTW fails to satisfy the triangle inequal- 
ity (DTW(x y) + DTW(y,z) > DTW(x,z)), hence the DTW is not a met- 
ric (iLemirel 120091). 



2.2 Large Margin Nearest Neighbor (LMNN) 



A conventional distance-learning approach is to find an optimal generalized 
ellipsoid distance measure with respect to a specific loss function. The LMNN 



algorithm proposed by Weinberger and Saul ( 2009 ) takes a different approach 



It seeks to force nearest neighbors to belong to the same class and it separates 
instances from different classes by a large margin. LMNN can be formulated 



as a semi-definite programming problem ( Vandenberghe and Boyd 1996). 



Specifically, we begin with a generalized ellipsoid distance measure Dm (x, y) 
(x— y) T M(x— y). We must solve for the matrix M given some data set of classi- 
fied time series x",xf J ', . . . ,xW. We require M to be positive semi-definite, 
so that the distance measure Dm is a pseudo-metric: it is symmetric, non- 
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negative and it satisfies the triangle inequality {^J Dm (x, y) + \J Dm(y, z) > 
y/D M (x, z)). 

Prior to computing M, we create two N x N matrices y and rj. We set yij 
equal to one whenever x^ and x^ are in the same class, otherwise j/y = 0. For 
all time series x^, we find k nearest neighbors under the Euclidean distance 
that are in the same class. (For 1-NN classification, we set k = 1.) Whenever 
is a nearest neighbor of x^\ we set rjij — 1, otherwise rjij = 0. Both 
matrices are computed once. 



Weinberger and Saul (2009) find the positive semi-definite matrix M by 



minimizing 



where the sums are over the range of indices {1, 2, . . . , N}, subject to the 
constraints that the e^/'s are non-negative and that 



D m (x ( '»,x (,) )^m(x (!) ,x (j) )>1 



The fixed parameter c is set by cross validation. Weinberger and Saul ( 2009 ) 
called the variables 6{ji slacked variables: they must be determined along with 
the matrix M. Though this problem can be solved using a generic solver, 
Weinberger and Saul (2009) found that they could get substantially better 



speed with a custom solver: we use their software in § [5] 



3 Mahalanobis distance measures 

Given a time series x.( k \ we write its i th data point as x\ . We compute the 
(sample) covariance matrix C = (c^) of a family of time series x^^x^ 2 ^, . . . , x^ 

of lengths n by = j^— j- X^fcLi^i^ — ^iX 2 ^ — %j) where N is the number 
of instances and where Xi is the average of the i th data point of the time series 



The Mahalanobis distance measure ( Mahalanobis , 1936 ) is a special case of 
the generalized ellipsoid distance measure D M (x,y) = (x— y) T M(x— y) where 
M is proportional to the inverse of the covariance matrix M cx C _1 . Though 
the Mahalanobis distance measure is often defined by setting M to the inverse 
of the covariance matrix (M — C _1 ), we find it convenient to normalize it when 
possible so that the determinant of the matrix M is one: M= (dct(C))sC*- 1 
where n is the length of the time series. The Mahalanobis distance measure 
minimizes the sum of distances between time series ^ x £>a/(x, y) subject to 
a regularization constraint on the determinant (det(Af ) = 1). In this sense, it 
is optimal. 

When the covariance is non-singular (det(C) ^ 0) then the covariance 
is positive definite, and so is the matrix M: it follows that the square root 
of the generalized ellipsoid distance measure is a metric. That is, we have 
Z?m(x, y) = ^ x = y, it is symmetric, non-negative and it satisfies the 
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triangle inequality ( \JDm (x, y) + \JDm{y, z) > \/Dm(x, z)). Unfortunately, 
the covariance matrix fails to be invertible when the number of instances (iV) 
is smaller than the number of data points (n). In § [4j we review some other 
solutions to address this problem. 



4 Computing Mahalanobis distance measures for time series 
classification 

The covariance matrix may be singular when the number of instances (iV) is 
smaller or about the same as the number of data points (n) in the time series. 
This is a common problem with time series: whereas individual time series 
might have thousands of data points, there may only be a few labeled time 
series in each class. 



4.1 Diagonal Mahalanobis distance measures 

The most straight-forward solution is to limit the covariance matrix C to its 
diagonal — thus producing a weighted Euclidean distance measure. Indeed, if 
we require that the matrix M be zero outside the diagonal, then restricting the 
covariance C to its diagonal (that is, setting M oc diag(C)) minimizes the 
sum of distances between time series. As long as the variance of each data point 
in our training sets is different from zero — a condition satisfied in practice 
in our experiments, the problem is well posed and the result is a positive- 
definite matrix. Hence, the generalized ellipsoid distance measure Dm(x, y) = 
(x — y) T M (x — y) is a metric. We normalize M so that its determinant is one. 
In such a diagonal case, the number of parameters to learn grows only linearly 
with the number of data points in the time series. In contrast, the number of 
elements in the full covariance matrix grows quadratically. One consequence 
is that the diagonal version of the Mahalanobis distance measure is computed 
much faster (0(n) vs. 0(n 2 )). 

Our version of the diagonal Mahalanobis distance measure is closely re- 
lated to the standardized Euclidean distance defined as the Euclidean distance 
between the components divided by their standard deviation: the square of 
the standardized Euclidean distance between x and y is Y^=i(( x i ~ Vi)l (J i) 1 
where Oi is the standard deviation of the i th component. However we must 
multiply the square of the standardized Euclidean distance by the Geometric 
mean of the variances ( \ZIIiLi °f ) ^° § e ^ our diagonal Mahalanobis distance 
measure. This normalization is a consequence of our requirement that the 
determinant of the matrix M be one: det(M) = 1. It is significant because 
we may simultaneously use several distance measures in the class-specific NN 
classification. 

Unfortunately, the diagonal Mahalanobis distance measure fails to use the 
information off the diagonal in the covariance matrix. See Fig. [I] for the co- 
variance matrix of a class of time series. It is clear from the figure that the 
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20 40 60 80 100 

(a) Sample of time series 



(b) Sample covariance 



Fig. 1: Ten samples of the Cylinder class from the CBF data set (Saito 



1994) and its sample covariance. Each time series has 128 data points. Higher 



absolute values in the matrix are presented using darker colors. 



covariance matrix has significant values off the diagonal. There are even block- 
like patterns in the matrix corresponding to specific time intervals. 



4.2 Moore-Penrose pseudoinverse and covariance shrinkage 

Could it be that non-diagonal Mahalanobis distance measures could be su- 
perior or at least competitive with the diagonal Mahalanobis distance? It is 
tempting to use banded matrices, but the restriction of a positive definite ma- 



trix to a band may fail to be positive definite. Block-diagonal matrices (Matton 



et al 2010) can preserve positive definiteness, but learning which blocks to use 
in the context of time series might be difficult. Instead, we propose two ap- 
proaches: one is based on the widely used Moore-Penrose pseudoinverse, and 
the other is covariance shrinkage. See Figure [2] for the three different covari- 
ance estimates of the same class: sample covariance, shrinked covariance and 
diagonal covariance. 

The approach based on the pseudoinverse is based on the singular value 
decomposition (SVD). We write the SVD as C = U SV T where E is a diagonal 
matrix with eigenvalues 71, 72, . . . and U and V are orthogonal matrices. The 
Moore-Penrose pseudoinverse is given by VS + U T where S + is the diagonal 
matrix made of the eigenvalues I/71, I/72, • ■ ■ with the convention that 1/0 = 
0. The pseudo-determinant is the product of the non-zero eigenvalues of S. 
We set M equal to the pseudoinverse of the covariance matrix — normalized 
so that it has a pseudo-determinant of one. This solution is equivalent to 
projecting the time series data on the subspace corresponding to the non-zero 
eigenvalues of S. That is, the matrix M is singular. Since the matrix M is 
still a positive semi-definite matrix, the square root of the generalized ellipsoid 
distance measure remains a pseudometric: it is symmetric, non-negative and it 
satisfies the triangle inequality. But it is no longer a metric since it is possible 
to find distinct x,y such that Dj^(x, y) = 0. 
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(a) Sample covariance (b) Shrinked covariance (c) Diagonal covariance 



Fig. 2: The covariance estimates of the Funnel class in the CBF data set. 
Large absolute values are in darker colors. Both the shrinked and diagonal 
covariances are positive definite whereas the sample covariance matrix is sin- 
gular. 



Covariance shrinkage is an estimation method for problems with small 



number of instances and large number of attributes (Stein 1956). It has bet- 



ter theoretical and practical properties for such data sets as the estimated 
covariance matrix is guaranteed to be non-singular. The covariance matrix C 
is positive semi-definite but can be singular. To prevent C from being singular, 
we replace it with an estimation of the form 



C* = XT+ (1 - \)C 



for some suitably chosen target matrix T: if T is a positive definite matrix and 
A 6 (0, 1], we have that XT+ (1 — A)C must be positive definite. Moreover, the 
smallest eigenvalue of AT + (1 — A)C must be at least as large as A times the 
smallest eigenvalue of T. We have used the target recommended by |S chafer 
and Strimmer (2005) which is the diagonal of the unrestricted covariance es- 
timate, T = diag(C). It is positive definite in our examples. For A, we use 
the estimation proposed by Schafer and Strimmer (2005) (see Appendix |A| for 
details). We then set M _1 oc C*, normalizing so that det(M) = 1. Unlike the 
pseudoinverse approach, covariance shrinkage generates a generalized ellipsoid 
distance measure which is a metric. 



4.3 Global and class-specific distance measures 

Given a training data set of time series, we can learn a single Mahalanobis 
distance measure from all time series, irrespective of their class labels (hence- 
forth global Mahalanobis distance measure). We have a single matrix M. In 
this case the 1-NN classification algorithm works as follow: given a candidate 
time series x we seek y from the training set such that Dm(x, y) is minimal. 
We then classify x in the same class as y. 
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Otherwise, we may learn one Mahalanobis distance measure per class of 
time series. In this class-specific approach, the covariance matrix is computed 
solely from the time series of one class. Hence, for each class, we get one 
distance measure: class i gets distance measure Dm < ■ We have one matrix (Mj) 
per class (i). Given y from the training set, let c(y) be the class of y. Given 
a candidate time series x, we classify it by finding y such that its distance 
to x, Djvf , . (x,y), is minimal. Hence, we not only compare the candidate 
time series x with time series from different classes, but we also use different 
distance measures. 

Thus, finally, we consider six types of Mahalanobis distance measures for 
1-NN classification: two localities (global or class-specific) and three estimators 
(pseudoinverse, shrinkage, or diagonal). 



5 Experiments 



The main goal of our experiments is to evaluate Mahalanobis distance mea- 
sures and the class-specific approach on time series. More specifically, we ask 
the following questions. 

— Of all the possible applications of the Mahalanobis distance measures 
(pseudoinverses, shrinkage or diagonal; class-specific or global), which one 
offers the best 1-NN classification accuracy? (§ 



5.2) 



How do Mahalanobis distance measures compare with state-of-the-art al- 
ternatives such as DTW or LMNN? (§ 



5.3) 



One of the simplest and most common distance measures, the Euclidean 
distance, is sometimes difficult to surpass for 1-NN classification of time 
series. To assess this effect, we ask how the relative accuracy of the Maha- 
lanobis distance measure changes as we increase the number of instances 
per class in the training set. (§ 



5.4) 



We begin all tests with a training data set comprising several classes of 
time series. When applicable, distance measures are learned from this data 
set. We then attempt to classify some test data using 1-NN. We define the 
classification error to be the percentage of misclassified instances whereas the 
accuracy is the percentage of properly classified instances. 



The code for the experiments is available online (Prekopcsak 2011 1 with in- 



structions on how the results can be reproduced. For LMNN, we use the source 



code provided by Weinberger and Saul ( 2008 1 for the experiments with default 



parameters. For the DTW, we find the best monotonic matching r minimiz- 



es E« 



(i.j)er 



Dj\ . The computational cost of the DTW is sometimes a 



challenge (Salvador and Chan 2007). To alleviate this problem, several strate- 



gies have been proposed including lower bounds and R*-tree indexes ( Lemire 



2009 Ouyang and Zhang 2010 Ratanamahatana and Keogh 2005). For our 



purposes, we use a quadratic-time dynamic programming algorithm. In con- 
trast, the Euclidean and diagonal Mahalanobis distance measures only require 
linear time. We ran the experiments on a MacBook Pro laptop with a 2.3 GHz 
Intel Core i5 processor and 8 GB of RAM. All code ran on Matlab R2011a. 
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Table 1: Number of classes, number of instances in both the training and 
testing sets, and the length of the time series in each data set. 



Data set 


classes 


training set 


testing set 


length (n) 


50 words 


50 


450 


455 


270 


Adiac 


37 


390 


391 


176 


CBF 


3 


30 


900 


128 


ECG 


2 


100 


100 


96 


Fish 


7 


175 


175 


463 


Face (all) 


14 


560 


1690 


131 


Face (four) 


4 


24 


88 


350 


Gun- Point 


2 


50 


150 


150 


Lighting-2 


2 


60 


61 


637 


Lighting-7 


7 


70 


73 


319 


OSU Leaf 


6 


200 


242 


427 


OliveOil 


4 


30 


30 


570 


Swedish Leaf 


15 


500 


625 


128 


Trace 


4 


100 


100 


275 


Two Patterns 


4 


1000 


4 000 


128 


Synthetic Control 


6 


300 


300 


60 


Yoga 


2 


300 


3 000 


426 



5.1 Data sets 



We use the UCR time series classification benchmark (Keogh et al 2006[ ) for 



our experiments as it includes diverse time series data sets from many do- 
mains. It has predefined training-test splits for the experiments (see Table [I]), 
so the results can be compared across different papers. Most of the data sets 
are z-normalized: that is, the time series have zero mean and a variance of 
one. We removed the two data that are not z-normalized by default (Beef and 
Coffee). Indeed, z-normalization improves substantially the classification ac- 
curacy — irrespective of the chosen distance measure. Thus, for fair results, 
we should z-normalize them, but this may create confusion with previously 
reported numbers. We also removed the Wafer data set as all distance mea- 
sures classify it nearly perfectly. The remaining 17 data sets were used for the 
comparison of different methods. 



5.2 Best Mahalanobis distance measure for 1-NN accuracy 

We compare the various Mahalanobis distance measures in Table [2j We have 
left out the Moore-Penrose pseudoinverse, because its error rates were twice as 
high on average compared to the other variants. What is immediately apparent 
is that the class-specific measures give better classification results. 

The diagonal Mahalanobis has a smaller classification error and is con- 
siderably faster (3.7 min compared to 5.5 min on the whole data set), but 
the shrinkage estimate yields significantly better results for several data sets 
(e.g. Adiac and Fish). Thus, out of the six variations, we recommend the 
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Table 2: Classification error for the various Mahalanobis distance measures. 



Data set 




Shrinkage 




Diagonal 


globa! 


class-specific 


global 


class-specific 


50 words 


0.34 


0.71 


0.34 


0.32 


Adiac 


0.33 


0.30 


0.37 


0.36 


CBF 


0.34 


0.06 


0.16 


0.05 


ECG 


0.12 


0.10 


0.10 


0.08 


Fish 


0.31 


0.15 


0.19 


0.18 


Face (all) 


0.32 


0.27 


0.32 


0.25 


T-i / r \ 

lace (lour) 


0.27 


0.16 


0.16 


0.17 


Gun-Point 


0.12 


0.14 


0.10 


0.11 


Lighting-2 


0.31 


0.30 


0.25 


0.25 


Lighting-7 


0.55 


0.32 


0.36 


0.23 


OSU Leal 


0.68 


0.69 


0.46 


0.46 


OliveOil 


0.17 


0.20 


0.17 


0.13 


Swedish Leal 


0.24 


0.15 


0.21 


0.18 


Trace 


0.40 


0.12 


0.21 


0.07 


Two Patterns 


0.12 


0.12 


0.12 


0.12 


Synthetic Control 


0.23 


0.08 


0.13 


0.09 


Yoga 


0.26 


0.22 


0.17 


0.17 


# ol best errors 





5 


5 
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class-specific covariance-shrinkage estimate and the class-specific diagonal Ma- 
halanobis distance measures. 



5.3 Comparing competitive distance measures 

How do the class-specific Mahalanobis distance measures behave in compari- 
son with competitive distance measures? Computationally, the diagonal Ma- 
halanobis is inexpensive compared to schemes such as the DTW or LMNN. 
Regarding the 1-NN classification error rate, we give the results in Table [3] 



As expected (Ding et al 20081, no distance measure is better on all data sets. 
However, because the diagonal Mahalanobis distance measure is closely related 
to the Euclidean distance, we compare their classification accuracy. In two 
data sets, the Euclidean distance outperformed the class-specific Mahalanobis 
distance measures and only by small differences (0.09 versus 0.10-0.12). Mean- 
while, the class-specific diagonal Mahalanobis distance measures outperformed 
the Euclidean distance 12 times, and sometimes by large margins (0.07 versus 
0.24 and 0.05 versus 0.15). The LMNN is also competitive: its classification 
error is sometimes half that of the Euclidean distance. 

The DTW has the lowest error rates and provides best results for half 
of the data sets, but it is much slower than Mahalanobis distance measures. 
It takes 3.7 min (diagonal) and 5.5 min (covariance shrinkage) to compute 
the Mahalanobis results on the whole data set. As expected, the diagonal 
Mahalanobis is nearly as fast as the Euclidean distance (3.5 min). The LMNN 
takes 18 min and the DTW runs for 18 hours. The DTW is at least two orders 
of magnitude slower than the diagonal Mahalanobis on all 17 data sets. 
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Table 3: Classification errors for some competitive schemes. For all distance 
measures, we use 1-NN classification. For the 50 words data set, the LMNN 
computation fails because it has a class with only one instance. For this table, 
we used the class-specific Mahalanobis distance measure. 



Data set 


Euclidean 


DTW 


Mahalanobis 
shrink. diag. 


LMNN 


50 words 


n 
u 


o i 


Q 


ox 


n 
u 


71 
i 1 


n 
u 








A_diac 





39 





40 





30 





36 





23 


CBF 





15 





00 





06 





05 





15 


ECG 





12 





23 





10 





08 





10 


Fish 





22 





17 





15 





18 





13 


Face (all) 





29 





19 





27 





25 





16 


Face (four) 





22 





17 





16 





17 





16 


Gun-Point 





09 





09 





14 





11 





05 


Lighting-2 





25 





13 





30 





25 





11 


Lighting-7 





42 





27 





32 





23 





51 


OSU Leaf 





48 





41 





69 





46 





57 


OliveOil 





13 





13 





20 





13 





13 


Swedish Leaf 





21 





21 





15 





18 





21 


Trace 





21 





00 





12 





07 





20 


Two Patterns 





09 





00 





12 





12 





05 


Synthetic Control 





12 





01 





08 





09 





03 


Yoga 





17 





16 





22 





17 





18 


# of best errors 




1 




9 




2 




3 




6 



5.4 Effect of the number of instances per class 

Whereas Table [3] shows that the Mahalanobis distance measures are far supe- 
rior to the Euclidean distance on some data sets, this result is linked to the 
number of instances per class. For example, on the Wafer data set (which we 
removed), there are many instances per class (500), and correspondingly, all 
distance measures give a negligible classification error. 

Thus, we considered three different synthetic time-series data-set genera- 
tors with varying numbers of instances per class: 



Cylindcr-Bcll-Funncl (CBF) (Saito 1994), 



Waveform ( |Breiman| 19981 



Control Charts (CC) (Pham and Chan 1998) and 



The CC data is made of 6 classes containing time series made of 60 data points; 
CBF is made of 3 classes and its time series have 128 data points; Waveform 
has 3 classes and its time series are made of 21 data points. All time series 
are z-normalized (zero mean and a variance of one). The CBF data set from 
Tables [2] and [3] was generated from the same data-set model, except that we 
vary the number of time series (see Appendix [B]) . 

Test sets have 1 000 instances per class whereas training sets have between 
10 to 1 000 instances. We repeated each test ten times, with different training 
sets. Fig.[3]shows that whereas the class-specific diagonal Mahalanobis distance 
measures are superior to the Euclidean distance when there are few instances, 
this benefit is less significant as the number of instances increases. Indeed, the 
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10 100 1000 

Number of instances per class 

Fig. 3: Ratios of the 1-NN classification accuracies using the class-specific 
diagonal Mahalanobis and Euclidean distance measures 

classification accuracy of the Euclidean distance grows closer to perfection and 
it becomes more difficult for alternatives to be far superior. 



6 Conclusion 

The Mahalanobis distance measures have received little attention for time se- 
ries classification and we are not surprised given their poor performance as a 
1-NN classifier when used in a straight-forward manner. However, by learning 
one Mahalanobis distance measure per class we get a competitive classifier 
when using either covariance shrinkage or a diagonal approach. Moreover, the 
diagonal Mahalanobis distance measure is particularly appealing computation- 
ally: we only need to compute the variances of the components. Meanwhile, 
we get good results with the LMNN on time series data, though it is more 
expensive. The DTW is superior, but it is one to two orders of magnitude 
slower. 

Acknowledgements This work is supported by NSERC grant 261437. 
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A Choice of the parameter A in covariance shrinkage 

Covariance shrinkage (see §[4] requires the choice of a parameter A £ (0, 1], which should 
be sufficiently large so that AT + (1 — A)C is numerically invertible. We choose T to be the 
diagonal of the covariance matrix C (T = diag(C)). 

To give the formula for A proposed by Schafcr and Strimmcr ( 2005 ), we need to introduce 
some technical notation. Given a family of time series x^ 1J ,x^,.. . , , we write the 
average of the i th component as = EifcLi x i' S ' ■ We write 

/ (ft) _ w (k) - \ 
U>kij = ( x i ~ 2 i)( x j ~ z 3> 

and Wij = jj 2~ZfcLi w kij- Moreover, we write 

N N 

Finally, we have 

_ Ej/j Var(c i3 ) 

~ r? 

Z^i^j ij 

where dj are the components of the (sample) covariance matrix C (see §[3|. We set A to A* 
when A* < 1. Otherwise, we set A = 1. 

We experimented informally with different values of A (using the data sets from from 
§|5j and found that the choice preconised by |Sch afcr and Strimmcr (2005J was reasonable. 
That is, we did not find a case where a different value of A gave much better classification 
accuracy. 



B The Cylinder-Bell-Funnel (CBF) data model 



Consider the orig inal CBF data model (Saito 1994). We can use it to generate time series 
of three possible classes. In the case where we have only 10 time series for each class in the 
training data set and a large number of time series in the test set (1000), we find that, over 
ten tests, the average 1-NN classification error rate is 0.20 (<r = 0.04) for the Euclidean 
distance and 0.15 (cr = 0.03) for the diagonal Mahalanobis distance measure. These results 
are difficult to reconcile with Table \3\ where we used a similar number of CBF time series 
provided by |Keogh et al[ (j2006| and where we report error rates of 0.15 and 0.05. Indeed, the 
difference in error rate for the diagonal Mahalanobis distance measure exceeds 3 standard 
deviations. 

After inspection, we found that the CBF data model used by |Keo gh ct al (2006) differs 
from the original presentation by Saito ( 1994]) . They both generate time series using random 
functions of the form: c(i) = {G+r])-X[a,b] W+ e (*)' 6(0 = (6+V) 'X[ a ,b] (0 ' (i — a)/(b—a)+e(i) 
and f(i) = (6 + rj) ■ X[a,b](0 ' ( b ~ 0/( b - a ) + e (0 where i = 1,.. .,128 and \[a,b] is the 
characteristic function. They both use standard normal variates for rj and e(i), and uniformly 
distributed a integer values in [16,32]. However, whereasjSaito (1994) states that b — a obeys 
an integer-valued uniform distribution on [32, 96], we found that |Keogh et al] ( |2006} generated 
their CBF data so that b — 32 is an integer- valued uniform distribution on [32.96]. 

If we adopt the Kcogh ct al (2006) variation, the classification error rates go down: 0.16 
(cr = 0.03) for the Euclidean distance and 0.10 (cr = 0.04) for the diagonal Mahalanobis dis- 
tance measure. These results are nearly within a standard deviation of the results presented 
in Table E] for CBF. 



